feat: implement self-healing recovery mechanism for gaps in L1 data by jonastheis · Pull Request #403 · scroll-tech/rollup-node

jonastheis · 2025-10-30T05:54:53Z

This PR implements a self-healing gap recovery mechanism for L1 messages and batch events. The actual gap detection happens in the ChainOrchestrator which subsequently notifies the L1Watcher that it needs to reset.

Specifically, the following changes are implemented:

detect gaps and duplicate L1 messages and commit batch events
handle detected gaps in ChainOrchestrator and reset L1Watcher
implement command receiver in L1Watcher to be able to reset to a certain sync height
- making sure that there's no deadlock as L1Watcher blocks if the send channel is full
special edge case of missing L1 message for commit batch: batches are retried in derivation pipeline if an L1 message is missing. eventually the L1 message should be recovered and the batch processed.

Fixes: #328, #235

…nOrchestrator

codspeed-hq · 2025-10-30T06:10:34Z

CodSpeed Performance Report

Merging #403 will not alter performance

_{Comparing feat/self-healing-l1-events (1a7949f) with main (ff96d5f)}

Summary

✅ 2 untouched

… blocks if the send channel is full

…in ChainOrchestrator

frisitano

Added some comments inline.

frisitano · 2025-11-05T05:35:00Z

crates/node/src/args.rs

+            // testing
+            #[cfg(feature = "test-utils")]
+            {
+                let (tx, rx) = tokio::sync::mpsc::channel(1000);


Can we create a L1 watcher handle and receiver channel here that can be used for testing?

done in 41b3ead

frisitano · 2025-11-05T05:35:26Z

crates/database/db/src/operations.rs

    async fn get_batch_by_index(
        &self,
        batch_index: u64,
+        processed: Option<bool>,


What's the purpose of adding the processed filter?

Thought I needed it at some point. reverted.

frisitano · 2025-11-05T05:36:48Z

crates/watcher/src/handle/command.rs

+        /// New sender to replace the current notification channel
+        new_sender: mpsc::Sender<Arc<L1Notification>>,
+        /// Oneshot sender to signal completion of the reset operation
+        response_sender: oneshot::Sender<()>,


Why is this needed?

not really needed. removed.

frisitano · 2025-11-05T05:37:44Z

crates/watcher/src/handle/mod.rs

+/// This trait allows the chain orchestrator to send commands to the L1 watcher,
+/// primarily for gap recovery scenarios.
+#[async_trait::async_trait]
+pub trait L1WatcherHandleTrait: Send + Sync + 'static {


What value does a trait add here as opposed to using a concrete type? Do we intend to have some sort of genericness on the handle?

yeah thought I'd need it for testing. removed now.

frisitano · 2025-11-05T05:40:38Z

crates/watcher/src/handle/mod.rs

+pub struct MockL1WatcherHandle {
+    /// Track reset calls as (`block_number`, `channel_capacity`)
+    resets: Arc<std::sync::Mutex<Vec<(u64, usize)>>>,
+}


Why do we need this? Can't we just inspect the receiver channel directly? I think we would then be able to remove MockL1WatcherHandle and the L1WatcherHandleTrait and just use the L1WatcherHandle directly. I think this would result in simpler code.

yeah thought I'd need it for testing. removed now.

frisitano · 2025-11-05T05:46:17Z

crates/chain-orchestrator/src/lib.rs

    /// A receiver for [`L1Notification`]s from the [`rollup_node_watcher::L1Watcher`].
    l1_notification_rx: Receiver<Arc<L1Notification>>,
+    /// Handle to send commands to the L1 watcher (e.g., for gap recovery).
+    l1_watcher_handle: Option<H>,


Why is this optional?

removed the option

crates/chain-orchestrator/src/lib.rs

frisitano · 2025-11-05T05:49:00Z

crates/chain-orchestrator/src/lib.rs

+                ) {
+                    Err(ChainOrchestratorError::L1MessageQueueGap(queue_index)) => {
+                        // Query database for the L1 block of the last known L1 message
+                        let reset_block =
+                            self.database.get_last_l1_message_l1_block().await?.unwrap_or(0);
+                        // TODO: handle None case (no messages in DB)
+
+                        tracing::warn!(
+                            target: "scroll::chain_orchestrator",
+                            "L1 message queue gap detected at index {}, last known message at L1 block {}",
+                            queue_index,
+                            reset_block
+                        );
+
+                        // Trigger gap recovery
+                        self.trigger_gap_recovery(reset_block, "L1 message queue gap").await?;
+
+                        // Return no event, recovery will re-process
+                        Ok(None)
+                    }
+                    Err(ChainOrchestratorError::DuplicateL1Message(queue_index)) => {
+                        tracing::info!(
+                            target: "scroll::chain_orchestrator",
+                            "Duplicate L1 message detected at {:?}, skipping",
+                            queue_index
+                        );
+                        // Return no event, as the message has already been processed
+                        Ok(None)
+                    }
+                    result => result,
+                }


Why don't we embed this logic inside of handle_l1_message?

done in 9865e89

frisitano · 2025-11-05T05:49:26Z

crates/chain-orchestrator/src/lib.rs

+                match metered!(Task::BatchCommit, self, handle_batch_commit(batch.clone())) {
+                    Err(ChainOrchestratorError::BatchCommitGap(batch_index)) => {
+                        // Query database for the L1 block of the last known batch
+                        let reset_block =
+                            self.database.get_last_batch_commit_l1_block().await?.unwrap_or(0);
+                        // TODO: handle None case (no batches in DB)
+
+                        tracing::warn!(
+                            target: "scroll::chain_orchestrator",
+                            "Batch commit gap detected at index {}, last known batch at L1 block {}",
+                            batch_index,
+                            reset_block
+                        );
+
+                        // Trigger gap recovery
+                        self.trigger_gap_recovery(reset_block, "batch commit gap").await?;
+
+                        // Return no event, recovery will re-process
+                        Ok(None)
+                    }
+                    Err(ChainOrchestratorError::DuplicateBatchCommit(batch_info)) => {
+                        tracing::info!(
+                            target: "scroll::chain_orchestrator",
+                            "Duplicate batch commit detected at {:?}, skipping",
+                            batch_info
+                        );
+                        // Return no event, as the batch has already been processed
+                        Ok(None)
+                    }
+                    result => result,
+                }


Why don't we embedd this logic in handle_batch_commit?

done in 9865e89

frisitano · 2025-11-05T05:51:27Z

crates/chain-orchestrator/src/lib.rs

+    /// # Arguments
+    /// * `reset_block` - The L1 block number to reset to (last known good state)
+    /// * `gap_type` - Description of the gap type for logging
+    async fn trigger_gap_recovery(


If we embed the L1Notification channel inside of the L1WatcherHandle then we can implement this logic on the L1WatcherHandle directly enabling better encapsulation.

Done in dce07df

This reverts commit f15ffb9.

…aling-l1-events Conflicts: crates/chain-orchestrator/src/lib.rs crates/node/src/args.rs crates/watcher/src/lib.rs crates/watcher/tests/indexing.rs crates/watcher/tests/logs.rs crates/watcher/tests/reorg.rs

…aling-l1-events

…s with L1 message queue hash calculation

…atch hash instead of index

…events

georgehao · 2026-01-05T08:26:42Z

crates/chain-orchestrator/src/lib.rs

+                    }
+
+                    // Check if batch already exists in DB.
+                    for existing_batch in tx.get_batch_by_index(batch.index).await? {


@frisitano for maybe wrong? this maybe result in a loop lock

georgehao · 2026-01-05T09:16:59Z

crates/chain-orchestrator/src/lib.rs

+                            // This means we have already processed this batch commit, we will skip
+                            // it.
+                            return Ok(Some(BatchCommitDuplicate(existing_batch.index)));
+                        } else if existing_batch.reverted_block_number.is_none() {


Suggested change

} else if existing_batch.reverted_block_number.is_none() {

} else if existing_batch.reverted_block_number.is_some() {

??

frisitano · 2026-01-30T11:21:10Z

Closing as stale.

feat: implement gap recovery mechanism for L1 watcher and use in Chai…

b509314

…nOrchestrator

jonastheis and others added 13 commits October 30, 2025 14:23

make sure that there's no deadlock with command receiver as L1Watcher…

0ea4ef7

… blocks if the send channel is full

feat: add skipping logic for duplicate L1 messages and batch commits …

5670af8

…in ChainOrchestrator

remove todo

ba20206

use select in watcher main loop

476d906

add test to test reset functionality

f6eaf09

add test for preventing deadlock if send channel is full

21588bc

fmt

c907bd4

add initial test setup

10bc36c

add L1WatcherHandleTrait for easier testability

51100a5

fix deadlock in test

46c09f9

l1 event handling

04c7e18

l1 event handling

c1a0500

add testing of gap recovery for batch

b96bda5

jonastheis marked this pull request as ready for review November 4, 2025 09:53

jonastheis requested a review from frisitano November 4, 2025 09:53

frisitano and others added 5 commits November 4, 2025 23:07

clean up

ccad3cb

fix lint

abcc90b

fix watcher tests

937b0e0

add possibility to filter by processed to get_batch_by_index

f15ffb9

make test easier to debug by failing instead of hanging

02fb909

frisitano reviewed Nov 5, 2025

View reviewed changes

jonastheis and others added 4 commits November 5, 2025 15:58

Revert "add possibility to filter by processed to get_batch_by_index"

49d38e5

This reverts commit f15ffb9.

address review comments

6a23c25

add test cases

b0e1e94

embed L1Notification channel receiver inside of the L1WatcherHandle

dce07df

jonastheis marked this pull request as draft November 6, 2025 05:32

cleanup

524304a

jonastheis mentioned this pull request Nov 6, 2025

Revisit setting safe block before batch is entirely processed #411

Closed

frisitano and others added 3 commits November 6, 2025 20:37

Merge branch 'main' into feat/l1-reorg

0493ec9

Merge remote-tracking branch 'origin/feat/l1-reorg' into feat/self-he…

47d35e7

…aling-l1-events Conflicts: crates/chain-orchestrator/src/lib.rs crates/node/src/args.rs crates/watcher/src/lib.rs crates/watcher/tests/indexing.rs crates/watcher/tests/logs.rs crates/watcher/tests/reorg.rs

fixes after merge

f4a999e

jonastheis changed the base branch from main to feat/l1-reorg November 12, 2025 04:48

frisitano and others added 9 commits November 12, 2025 13:16

address feedback

c59007d

use alloc String

53d0923

use alloc ToString

de57dc4

Merge branch 'main' into feat/l1-reorg

08ca239

add l1_watcher_command_rx to addons for testing like l1_watcher_tx

41b3ead

Merge remote-tracking branch 'origin/feat/l1-reorg' into feat/self-he…

90fc085

…aling-l1-events

move checks into respective functions

9865e89

implement test for gap detection for batch and L1 messages. fix issue…

2f2960c

…s with L1 message queue hash calculation

implement gap and skip detection for revert events

d35eba1

Base automatically changed from feat/l1-reorg to main November 17, 2025 14:23

jonastheis linked an issue Nov 20, 2025 that may be closed by this pull request

[RPC] Allow changing L1 synced height via authenticated RPC #235

Closed

jonastheis added 6 commits November 20, 2025 16:30

Merge remote-tracking branch 'origin' into feat/self-healing-l1-events

8ed53a6

fixes after merge

875f745

fix test shutdown_consolidates_most_recent_batch_on_startup

54bb0c3

refactor test_batch_commit_gap with test fixture

c19beb8

refactor test_l1_message_gap with test fixture

32587a2

refactor test_batch_revert_gap with test fixture

7eaf491

jonastheis marked this pull request as ready for review November 21, 2025 10:49

jonastheis requested a review from frisitano November 21, 2025 10:49

jonastheis added 3 commits November 25, 2025 08:55

fix derivation pipeline benchmarks due to derivation pipeline using b…

283bcf9

…atch hash instead of index

Merge remote-tracking branch 'origin/main' into feat/self-healing-l1-…

ba5497a

…events

fixes after merge

1a7949f

georgehao reviewed Jan 5, 2026

View reviewed changes

frisitano closed this Jan 30, 2026

	} else if existing_batch.reverted_block_number.is_none() {
	} else if existing_batch.reverted_block_number.is_some() {

Conversation

jonastheis commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codspeed-hq bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #403 will not alter performance

Summary

Uh oh!

frisitano left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

georgehao Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

frisitano commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jonastheis commented Oct 30, 2025 •

edited

Loading

codspeed-hq bot commented Oct 30, 2025 •

edited

Loading

georgehao Jan 5, 2026 •

edited

Loading